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Evaluations  of  In5)lied  Orders  as  a Basis  for 
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Tailored  Testing  Using  Simulations 


Norman  Cliff,  Robert  Cudeck,  and  Douglas  McCormick 


The  basic  principle  of  the  Implied  Orders  system  is  a simple  one.  It 
arises  from  considering  dichotomous  items  as  furnishing  ordering  relations 
between  persons  and  items.  If  the  relations  are  consistent  with  each  other, 
then  taken  as  a whole  they  furnish  a joint  ordering  of  the  persons  and  the 
items.  It  is  well  known  that  the  logical  properties  of  an  order  are  such 
that  if  certain  of  the  relations  among  elements  are  known,  then  the  remainder 
can  be  deduced  from  them  by  making  use  of  the  transitivity  property  which 
characterizes  orders.  This  is  the  principle  that  underlines  our  approach 
to  computer-interactive  testing.  Its  basis  is  spelled  out  in  a 1975 
article  (Cliff,  1975) . The  general  idea  is  that  even  an  incomplete  matrix 
of  responses  of  persons  to  items  can  be  used  to  deduce  at  least  some 
order  relations  between  items,  which  is  their  relative  difficulty.  These 
order  relations  between  items  in  turn  can  be  used  to  predict  what  the 
individual's  responses  will  be  to  items  not  yet  taken,  and  therefore  re- 


move  the  necessity  of  asking  those  items. 


Taking  a joint  order  as  a model  for  test  items  is  equivalent  to  assum- 
ing that  the  data  provide  a Guttman  scale,  but  everyone,  even  we,  knows 
that  test  data  are  not  Guttman  scales.  The  idea  however,  is  that  the 
joint  order  is  an  approximate  model  for  such  data.  Then  the  problem  in 
tailored  testing  is  one  of  how  to  modify  the  transitivity  principle  to 
make  it  work  reasoneibly  well  in  the  presence  of  error.  The  approach  that 
is  used  here  is  a rough  and  ready  statistical  i'>ne.  At  any  given  time, 
there  are  a certain  number  of  responses  that  imply  that  item  i is  harder 
than  k,  cmd  a certain  number  that  imply  the  reverse.  If  one  kind  pre- 
dominates over  the  other,  then  it  is  in^lied  that  one  item  is  easier  than 
the  other.  Similarly,  the  pattern  of  responses  to  date  by  am  individual 
may  be  such  that  a certain  number  of  them  imply  that  he  should  get  a 
particular  item,  which  is  as  yet  untaken,  wrong;  correspondingly,  some 
others  may  imply  that  he  should  get  it  right.  If  one  number  predominates 
over  the  other,  then  the  implication  is  made  correspondingly. 


Procedure 


General  Idea 

Tcdale  1 provides  an  illustration  of  the  way  the  procedure  operates. 

The  two  columns  on  the  left  show  the  responses  of  15  persons  to  two  items 

j and  )c.  To  determine  which  of  two  items  is  easier,  we  examine  n.,  , the 

j*^ 

number  who  get  ^ right  and  Ic  wrong,  in  comparison  to  n , the  number  who 
do  the  reverse.  In  the  data  illustrated,  person  5 is  the  only  one  who 
gets  2 tight  and  k wrong,  whereas  4,  6,  7,  8,  9,  10  and  11  do  the  reverse. 
The  frequencies  1 and  7 and  n^j^  respectively)  are  then  shown  at  the 


Table  1 


Illustrative  Basis  of  TAILOR  Process 


Con5)lete 


Inconplete 


Persons 


Items 
j k 


Dominant 

Item 


Persons 


Items  Dominant 

j k I tern 


1 1 


0 1 
0 1 


0 1 


0 0 


Dominance 

Frequencies 


Dominance 

Frequencies 


j k 


j k 


7 1 


3 0 


P = 9/256, 
therefore  j > k 


P = 1/8  , 
therefore  j>k 


bottom  of  the  table.  Use  of  a statistical  decision  rule,  which  is  out- 
lined below,  would  lead  to  the  decision  that  k was  easier  than  For 
each  pair  of  items  in  a test,  such  a comparison  is  made  by  means  of  the 
decision  rule.  The  results  of  the  comparisons  are  recorded  in  what  we 
call  the  item  dominance  matrix.  In  the  matrix,  a 1 means  that  the  row 
item  is  harder  than  the  column  item. 

The  foregoing  is  applicable  to  a complete  test.  In  an  incomplete 

or  tailored  test,  some  of  the  responses  would  be  missing,  as  shown  in 

the  two  righthand  columns  of  the  table.  The  quantities  n^^^  and  n^^^  can 

still  be  counted,  however.  Since  person  5 has  now  only  one  item,  n = 0. 

of  the  seven  persons  who  had  2 wrong  and  k right,  we  now  have  data  on 

both  items  from  only  three,  persons  7,  8 and  11,  so  n.,  is  three.  The 

3^ 

comparison  of  n.,  to  n,  . could  still  lead  to  the  conclusion  that  j was 
Ik  k]  ^ 

harder  thcin  Is  if  this  was  all  the  information  that  was  available,  pro- 
vided that  the  liberal  rule  were  used. 

Now  consider  person  2.  He  got  the  harder  item  right,  and  has  not 
taken  the  easier  one.  We  could  conclude  that  he  would  get  the  latter 
right  also,  and  would  not  give  it  to  him.  Similarly,  person  13  has  the 
easier  item  wrong,  so  we  could  conclude  that  he  would  get  the  harder  one 
wrong  also  if  he  were  to  take  it,  and  therefore  not  bother  to  give  it 
to  him.  Actually,  what  we  do  in  making  decisions  of  this  kind  is  simi- 
lar to  what  is  done  with  respect  to  deciding  which  items  are  easier 
and  which  harder.  Suppose  person  ^ has  not  yet  taken  item  2*  We  look 
at  the  number  of  harder  items  he  has  passed  and  compare  that  to  the  num- 
ber of  easier  items  he  has  failed.  If,  by  the  same  decision  rule  used 


earlier,  the  latter  of  these  preponderates  over  the  other,  he  corres- 


pondingly  is  implied  to  have  failed  item  ^ also;  if  the  reverse,  then  he 
is  assumed  to  have  passed  it.  If  neither  preponderates,  then  no  decision 
is  made,  cind  there  is  no  inplied  response  by  the  person  to  the  item. 

Specific  Operations 

In  the  progrcim  description  (Cudeck,  et  al.  Note  1),  the  frequency- 
comparing  routines  above  are  part  of  subroutines  SQUARE,  which  computes 
item  dominances,  and  MULT,  which  computes  in^Jlied  item  responses.  These 
are  the  major  uses,  but  the  con^iarisons  are  used  in  two  other  places  as 
well.  The  binary  item  dominance  matrix  resulting  from  SQUARE  is  multiplied 
by  itself  to  derive  "second  order"  item  dominances.  That  is  suppose,  not 
enough  persons  have  taken  both  items  j and  p to  yield  ein  inplied  difficul- 
ty order  between  them,  but  j has  been  found  to  be  harder  than  k and  k 
is  harder  than  p.  This  suggests  that  j is  harder  them  p.  Evidence  of 
this  kind,  con5)aring  j to  p through  all  the  other  items,  is  used  with  the 
significance  test  in  just  the  same  way  that  it  is  in  the  others.  This 
is  done  in  the  subroutine  called  IMPLY.  There  is  one  final  place  where 
it  is  used,  and  this  is  to  help  define  the  order  of  the  persons  using  the 
subroutine  COUNT.  Here,  the  in^Jlied  rights  matrix  is  multiplied  by  the 
in^lied  wrongs  (each  resulting  from  MULT)  in  order  to  see  if  person  i has 
significantly  more  items  right  that  person  h gets  wrong  than  the  reverse. 

If  so,  i ranks  above  h.  Thus  the  frequency-comparison  process  is  used  to 
establish  direct  difficulty  order  relations  (SQUARE) , indirect  difficulty 
order  relations  (IMPLY)  , in^lied  item  responses  (MULT)  , and  iit^ilied  person- 


order  relations  (COUNT) . 


"significance  Tests" 


The  decision  rule  used  in  comparing  frequencies  is  a rather  liberal 
one.  It  has  two  parts.  The  major  part  corresponds  to  comparing  fre- 
quencies by  a binomial  probability  (McNemar's  test;  McNemar,  1969)  and 
rejecting  the  null  with  a one-tailed  alpha  level  of  .33  . Values  of  n^j^ 
and  n of  2 to  0 and  3 to  1,  respectively,  thus  lead  to  rejection,  and 
an  implication  is  made. 

The  second  aspect  of  the  rule  has  to  do  with  instances  where  the 
frequencies  are  1 and  0.  If  the  information  is  very  sparse,  i.e.,  early 
in  the  testing,  most  of  the  frequencies  compared  are  either  0,0  or  1,0, 
and  it  is  necessary  to  get  some  information  out  of  the  latter  in  order  to 
inprove  item- assignments . However,  when  the  information  is  less  sparse, 
frequencies  of  at  least  1,0  are  almost  bound  to  occur.  Thus  the  decision 
as  to  whether  to  "believe"  that  a 1,0  frequency  implies,  say,  a difficulty 
order,  depends  on  how  sparse  the  information  is. 

The  specific  nature  of  the  test  involves  the  evaluation  of  a chance 

probability.  Suppose  a vector  has  n elements,  n^  of  which  are  1 and  the 

remainder  zero.  Suppose  a second  n-vector  has  n^^  ones  and  the  rest  zero. 

If  the  n.  ones  are  scattered  at  random  in  the  first  vector  and  the  n,  are 
3 k 

scattered  at  random  in  the  other,  and  the  two  vectors  are  laid  side  by 
side,  what  is  the  probcibility  that  none  of  the  ones  in  one  vector  are 
matched  by  ones  at  corresponding  places  in  the  other?  If  the  complement 
of  this  probability  is  found,  this  is  the  probability  of  at  least  one 
pair  of  ones  with  the  same  index  in  the  two  vectors,  i.e.,  a frequency 


that  is  not  0,0  . Obviously,  if  n,  + n is  greater  than  n,  there 

3 k 


must  be  at  least  one  match.  If  not,  the  probability  of  zero  matches 


7 


is  given  by  the  following  formula,  where  n.  is  greater  than  n : 


If  p(0)  is  . 5 or  greater,  this  in^lies  that  a reindom  match  is  unlikely 
to  occur,  and  so  the  observed  1.0  frequency  probably  represents  real  in- 
formation, and  the  implied  relation  is  made.  If  it  is  less  than  .5,  the 
probability  is  considered  too  great,  that  the  matching  elements  occurred 
by  chance,  and  no  implication  is  made. 

These  standards  will  clearly  seem  incautious  to  anyone  raised  in  the 
.05  - .01  tradition  of  significance  testing.  Two  things  may  be  borne  in 
mind.  Most  of  these  in^lications  of  order  are  subject  to  reversal  on 
the  basis  of  later  evidence,  so  the  decisions  are  not  irrevocable.  Second, 
we  do  not  have  the  same  payoff  matrix  here  as  underlies  traditional 
hypothesis  testing.  Particularly  at  the  early  stages,  the  penalty  for 
concluding  that  there  is  not  a difference  in  difficulty  when  there  is  one  is 
as  large  as  the  penalty  for  concluding  that  there  is  a difference  when  there 
is  not.  We  were,  in  fact,  forced  to  abandon  the  use  of  more  traditional 
significance  levels  by  a good  deal  of  e:q)loratory  simulation  work,  cund  it 
was  not  until  we  adopted  this  mode  that  we  began  to  get  reasonably  good 
results. 

Assigning  Items  to  Persons 

At  any  given  time  in  the  testing  process,  as  many  inferences  as  pos- 
sible are  made  about  the  relative  difficulty  of  the  items.  These  in  turn 


8 

are  used  to  imply  responses  for  each  person  to  items  he  has  not  yet  taken, 
and  these  in  turn  are  used  tc  help  determine  the  joint  order  of  persons 
and  items.  The  latter  is  necessary  for  the  purpose  of  assigning  items 
to  persons  optimally,  insofar  as  current  information  allows,  during  the 
the  course  of  the  interactive  testing. 

A total  score  is  assigned  to  each  item  and  each  person.  For  a per- 
son, this  is  the  total  number  of  items  which  he  gets  right,  either  di- 
rectly or  by  implication  (from  MULT)  plus  the  number  of  persons  he  ranks 
above  or  dominates  (from  COUNT)  minus  the  number  of  items  and  persons  that 
dominate  him,  as  derived  from  the  same  sources.  For  an  item,  this  total 
score  is  the  number  of  persons  who  get  it  wrong — directly  or  indirectly 
as  determined  from  MULT — plus  the  number  of  items  it  is  harder  than, 
determined  from  CQUARE  and  IMPLY,  minus  the  number  of  persons  who  get 
it  right,  and  the  nuinber  of  items  it  is  easier  than,  as  determined  from 
the  same  sources.  In  this  way,  the  persons  and  the  items  are  ranked  on 
the  same  scale. 

The  person  takes  the  item  for  which  he  has  no  direct  or  implied 
response  and  which  lies  closest  to  him  on  the  joint  scale. 

Implementation 

There  are  two  basic  modes  of  operation,  which  might  be  called  si- 
multaneous and  cumulative.  The  simultaneous  one,  TAILOR,  which  was 
developed  first  (Cudeck,  Cliff,  Reynolds  and  McCormick,  Note  1;  Cudeck, 
Cliff,  and  Kehoe,  in  Press)  assumes  that  a group  of  people  is  taking 
the  test  at  the  same  time,  whereas  the  cumulative  one,  TAILOR-APL,  assumes 


that  subjects  are  tested  individually. 
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In  TAILOR,  testing  takes  place  by  what  might  be  called  rounds.  At 
each  round,  each  person  receives  an  item  unless  he  has  con^leted  the  test. 
Assignment  of  items  to  persons  takes  place  at  random  for  the  first  round, 
and  in  subsequent  rounds  each  person  is  assigned  the  item  that  is  closest 
to  him  in  the  joint  scale  whose  determination  was  described  eibove.  There 
is,  however,  a restriction  on  the  number  of  persons  that  can  be  assigned 
a particular  item  on  a particular  round. 

This  procedure  is  illustrated  in  Figure  1 where  the  first  three  panels 
show  the  operation  at  an  early  stage  of  the  process.  The  data  are  for  25 
persons  and  15  items  on  the  Stanford-Binet.  The  left  one  is  the  actual 
response  matrix;  the  middle  one  is  the  item  dominance  relations  that  are 
implied  by  them,  and  the  right  one  is  the  implied  response  matrix.  In  each, 
a 1 means  correct  or  dominance,  a zero  means  incorrect  or  antidominance, 
and  a blank  means  no  relation.  The  middle  set  of  panels  shows  the  operation 
at  an  intermediate  stage  of  the  testing  process,  and  the  bottom  one  shows 
the  final  stage,  the  right  most  panel  showing  that  the  score  matrix  is  now 
complete  by  iit^li cation. 

The  second  mode  of  operation  is  a sequential  or  cumulative  one  which 
tests  individual  subjects  only.  This  is  called  TAILOR-APL  (McCormick  and 
Cliff,  in  press) . Again,  no  knowledge  about  the  items  is  assumed.  The 
first  person  must  therefore  take  all  items.  After  a few  persons,  however, 
there  may  be  enough  information  to  define  the  rslative  difficulty  of  some 


items.  Insofar  as  this  is  the  case,  it  is  used  for  subsequent  persons  to  j 
iitply  their  responses  to  some  items.  As  more  and  more  information  accu-  | 
mulates,  more  relative  difficulty  relations  also  accumulate,  so  that  the  j 
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It  is  TAILOR  which  is  evaluated  in  the  two  simulation  studies  reported 
here,  while  TAILOR-APL  was  evaluated  with  live  subjects,  and  will  be  reported 
separately. 


Monte  Carlo  Study 


Two  Types  of  Simulation 


Considerable  development  and  testing  has  gone  into  the  development 
of  these  programs  as  workable  operationalizations  of  the  general  concepts 
were  sought.  The  TAILOR  program  seems  to  represent  a satisfactory  combin- 
ation of  characteristics,  and  a series  of  simulation  studies  were  undertaken 
in  order  to  assess  its  actual  performance  and  its  sensitivity  to  various 
parameters  of  the  items  and  the  testing  situation. 

Two  types  of  simulation  were  done.  The  more  extensive  was  carried  out 
as  a Monte  Carlo  process  where  the  outcome  of  giving  a particular  item  to 
a particular  person  was  determined  by  calculating  the  probability  of  cor- 
rect response  according  to  the  four-parameter  Birnbaum  model  (Birnbaum,  1968) 
and  comparing  a random  number  on  the  0,1  interval  to  that  probability.  The 
results  of  this  study  are  the  main  focus  of  this  report.  Responses  from 
human  subjects  were  also  used.  A file  of  responses  of  622  children,  ages 
2 years  0 months  to  14  years  11  months , to  the  Stanford-Binet  provided  the 
data;  correctness  of  a person's  response  to  an  assigned  item  was  determined 
by  looking  in  the  file  of  responses.^  The  results  of  this  study  will  be 
given  in  a later  section. 


Parameters  Studied  in  the  Monte  Carlo  Study 

In  the  simulation  study,  true  score  was  defined  to  be  normally  distri- 

^We  wish  to  express  our  appreciation  to  Dr.  Mark  Reckase  of  the  University 
of  Missouri  for  providing  this  data  file. 
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buted  in  the  population  with  mean  zero  and  standard  deviation  one.  A sairple 
of  predetermined  size  was  drawn  from  this  population.  Item  difficulty  was 
also  assumed  to  be  normally  distributed,  but  the  population  mean  eUid  stan- 
dard deviation  could  be  varied  as  a program  paranveter. 

This  enabled  us  to  study  the  effect  of  a mismatch  between  difficulty  and 
true  score.  A predetermined  number  of  items  was  sampled  from  the  population 
with  these  characteristics.  Thus,  rather  than  being  constrained  to  have  the 
same  mean  and  variance  in  the  sanple,  item  difficulty  statistics  could  de- 
viate as  a result  of  either  san^ling  fluctuations  or  deliberate  manipulation 
of  parameters.  This  was  felt  to  introduce  an  element  of  realism  that  would 
be  missing  if  difficulty  and  true  score  distribution  parameters  were  con- 
strained to  be  equal. 

Discrimination  values  of  the  simulated  items  were  also  sampled  from  | 

populations  whose  parameters  could  be  varied.  Again,  a normal  distribution  | 

was  assumed  for  the  discrimination  parameters.  In  this  way  the  effect  of  | 

varying  discrimination  could  be  studied.  | 

1 

i 

The  guessing  probability  was  also  varied  so  that  its  effect  could  be 
assessed.  It  was  fixed  at  a particular  constant  value  rather  than  sampled 
from  a population,  however.  The  nuntier  of  items  in  the  pool  and  the  number 
of  persons  being  tested  was  also  varied. 

With  this  nxjinber  of  variables  it  was  not  possible  to  vary  all  of  them 
simultaneously  or  over  a wide  range  of  values.  Instead,  two  or  three  values 
of  each  were  selected  on  the  basis  of  realism  and  practicality,  ^md  small 
factorial  designs  involving  two  or  three  of  them  were  constructed.  In  this 
way,  the  main  effects  and  certain  first  and  second-order  interactions  could 


be  studied,  but  higher  order  interactions  could  not.  The  combinations  selected 


were  chosen  on  the  basis  of  their  expected  inporteince,  and  it  is  hoped  that 
the  principal  effects  were  thus  identified. 


The  analysis  of  the  effects  of  these  variables  were  carried  out  as 
analyses  of  variance  and  analyses  of  covariance,  true  score  with  observed 
score  in  the  complete-data  case  for  the  same  data  being  used  as  the  single 
covariate.  This  permitted  the  assessment  of  the  degree  to  which  there  were 
effects  over  and  above  that  of  the  basic  consistency  of  the  data.  In  addi- 
tion to  the  numerous  small  analyses,  a regression  analysis  with  all  the  data 
combined  into  a single  score  matrix  and  the  main  effects  as  independent 
variables  was  carried  out,  both  with  the  covariate  and  without  it. 

The  combinations  of  parameter  values  used  here  are  given  in  Table  2. 

The  assumed  number  of  subjects  was  10,  25  or  40,  with  the  great  majority 
of  the  data  coming  from  the  latter  two  values.  The  number  of  items  was 
assumed  to  be  either  15  or  25.  These  rather  small  numbers  of  persons  and 
items  were  chosen  for  reasons  of  economy.  Mean  discrimination  was  assumed 

i 

at  either  1.0  or  2.0  except  in  two  cases  where  it  was  .5  . The  standard  | 

deviation  of  discrimination  was  usually  assumed  to  be  zero,  but  also  took  | 

) 

) 

on  values  of  .2  or  .4  . The  mean  population  difficulty  was  usually  assumed  ] 

to  be  zero,  equal  to  the  mean  ability,  but  was  also  set  at  1.0  for  two  runs.  I 

Similarly,  the  standard  deviation  of  difficulty  was  usually  1.0,  but  set  | 

at  2.0  for  two  runs.  The  chance  parameter  was  usually  zero,  but  also  took  { 

on  values  of  . 1 or  .2  . The  basic  design  was  a 2 x 2 x 2 with  25  or  40 
persons,  15  or  25  items,  and  discrimination  of  1.0  or  2.0  . The  other  para- 
meter variations  were  usually  made  singly  in  combination  with  different  levels 
of  one  or  two  of  these  main  parameters.  The  specific  combinations  used  can 


be  seen  in  Tcible  2,  where  each  row  defines  a set  of  conditions  for  a san^le 


Condition 


Table  2 


Characteristics  of  Sairples  of  Score  Matrices 


Generated  by  Latent  Trait  Models 


San5)le  Characteristics 


p Discrim- 


mber 

Persons 

Items 

inat; 

1 

10 

25 

1.0 

2 

10 

25 

2.0 

3 

25 

15 

.5 

4 

25 

15 

.5 

5 

25 

15 

1.0 

6 

25 

15 

2.0 

7 

25 

15 

1.0 

8 

25 

15 

2.0 

9 

25 

15 

1.0 

10 

25 

15 

2.0 

11 

25 

15 

1.0 

12 

25 

15 

2.0 

13 

25 

25 

1.0 

14 

25 

25 

2.0 

15 

25 

25 

1.0 

16 

25 

25 

2.0 

17 

25 

25 

1.0 

18 

25 

25 

2.0 

19 

25 

25 

1.0 

20 

25 

25 

2.0 

21 

40 

15 

1.0 

22 

40 

15 

2.0 

23 

40 

15 

1.0 

24 

40 

15 

2.0 

25 

40 

25 

1.0 

26 

40 

25 

2.0 

27 

25 

25 

2.0 

28 

25 

15 

2.0 

a Discrim- 
ination 


VI  Diffi- 
culty 


0 Diffi- 
culty 


Chance 


o o 
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of  score  matrices  generated  by  the  bimbaum  model.  The  correlation  matrix 
of  major  variables  appears  in  the  appendix,  Table  A. 

Data  Generation 

Five  sample  score  matrices  were  generated  according  to  the  parameters  of 
each  line  of  Table  2.  Given  the  person  scores  and  the  item  parameters,  a prob- 
ability of  correct  responses  of  each  persons  to  each  item  is  confutable  using 
the  Birnbaum  model.  Then  a random  number  determined  whether  the  response  was 
correct  or  not.  The  resulting  score  matrices  were  stored  for  later  reference 
by  TAILOR.  The  specific  procedures  are  in  Appendix  1. 

The  intial  round  of  assigning  each  person  an  item  took  place  at  reuidom. 

The  correctness  of  the  response  was  determined  by  sinf  ly  looking  in  the  appro- 
priate location  of  the  stored  matrix  of  model-generated  item  responses.  For 
each  subsequent  round,  the  matching  of  item  with  person  took  place  by  means 
of  the  confutation  of  implied  responses  and  matching  of  item-person  scores,  as 
was  outlined  earlier.  The  session  was  complete  when  there  was  an  actual  or 
inf  lied  response  for  each  item  and  person  in  the  score  matrix. 

In  addition  to  the  true  score  that  was  used  to  generate  the  data,  there 
are  two  scores  for  each  person  in  a given  sanfle.  One  is  his  score  from  the 
tailored  simulation,  and  the  other  is  his  score  on  the  coirflete  test.  These 
will  be  referred  to  as  Tailored  scores  and  Conflete  scores,  respectively; 

Note  that  they  are  not  experimentally  independent  because  the  responses  on 
which  the  Tailored  score  is  determined  are  a subset  of  the  responses  deter- 
mining Total  score.  A number  of* statistics  were  calculated  from  these,  and 
some  of  them  were  iitfortant  to  the  later  euialysis. 

Dependent  Variables 


A variety  of  independent  variables  were  used,  but  not  all  in  all  analyses. 
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I The  correlations  among  True  score,  Tailored  score  and  Total  score  were  j 

I'  computed.  The  two  correlations  with  True  score  are  the  validities  of 

[ 

t Tailored  and  Total  scores.  These  correlations  were  used  as  a major  de- 

pendent variable  and  a covariate,  respectively.  The  reasoning  was  that 
agreement  with  true  scores  is  the  major  quality  which  one  desires  from 
a tailored  testing  procedure  and  the  expectation  that  variations  in 
Total  score  validity  would  be  the  major  source  of  influences  which  would 

f 

need  to  be  controlled.  Both  Pearson  and  rank-order  (tau)  coefficients 
were  computed,  but  the  results  of  the  analyses  were  in  extremely  close 
agreement,  so  only  one  set  will  be  reported,  the  tau.  The  Fisher 
Z-transform  was  not  used  because  in  our  experience  it  has  little  effect 
on  results  unless  the  mean  and  variance  of  the  correlations  are  both 
large,  and  rarely  even  then. 

An  additional  variable  reflecting  the  accuracy  of  the  operation  of 
TAILOR  is  the  proportion  of  responses  to  items  that  were  not  taken  which 
were  correctly  predicted  by  TAILOR.  Other  variables  which  were  used  re- 
flected efficiency  and  cost  of  the  procedure.  In  particular,  the  ratio  of  | 

actual  responses  to  total  possible  responses  in  a given  score  matrix  is 
clearly  relevant,  as  is  the  amount  of  computer  processing  time  used  per  run. 

The  effects  of  the  independent  variables  on  these  were  assessed  as  well  as 
determining  the  overall  levels. 

A Monte  Carlo  evaluation  of  this  kind  with  multiple  dependent  and 
independent  variables  provides  opportunity  for  many  analyses.  Not  all  pos- 
sible ones  were  performed.  Rather,  the  attempt  was  to  be  selective  and  in- 
vestigate the  most  important  questions  and  assess  the  plausibility  of  the 


most  reasoncible  alternative  explanations  for  any  effects  that  were  present. 
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Results 


Basic  Results 

While  the  major  results  are  included  in  Table  3,  particular  por- 
tions of  them  will  be  selected  out  for  further  analysis  and  comment. 

The  data  there  is  listed  in  the  same  order  of  conditions  as  in  Ta.jle  2. 
Here,  the  correlations,  both  Pearson  and  tau,  of  total  score  on  the  com- 
plete test  with  true  score  are  given.  Also  given  are  the  correlations 
with  score  from  the  TAILOR  operation  on  the  same  data.  The  percentage 
of  responses  under  TAILOR  and  the  amount  of  CPU  time  used  are  likewise 
presented.  The  last  line  of  the  table  shows  the  means  of  each  variable. 

The  means  represent  averages  across  conditions*  which  affect  these 
variables,  and  some  of  the  effects  are  substantial.  Therefore,  their 
precise  values  should  be  viewed  rather  tentatively.  They  do,  however, 
provide  a quick  summary  of  the  major  findings.  On  the  average,  TAILOR 
presented  55  per  cent  of  the  items  to  each  person.  The  validity  cor- 
relations for  the  tailored  scores  averaged  .757  (tau)  and  .889  (r) • 

These  are  similar  to  the  complete-test  validities  of  .810  and  .926. 


Influences  on  Proportion  of  Responses 

Fifty-five  percent  of  the  items  is  rather  a large  proportion,  if 
one  is  hoping  to  make  material  increases  in  test  efficiency.  Thus  the 
factors  that  seem  to  influence  the  proportion  of  items  in  the  test 
bank  which  must  be  presented  to  the  person  are  of  interest.  The  major 
ones  appear  to  be  the  number  of  items  and  persons.  The  more  items  in 


the  pool  and  the  persons  being  tested,  the  smaller  the  proportion  of  items 
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Table  3 

Cell  Means  of  Major  Dependent  Variables 
Condition  Complete  Data  Tailored  Data  % CPU 


Number 

Tau 

Pearson 

Tau 

Pearson 

Responses 

(per  person) 

1 

.769 

.941 

.724 

.850 

.590 

3.64 

2 

,880 

,955 

.846 

.946 

.577 

3.54 

3 

.745 

.876 

.636 

.810 

.615 

.54 

4 

.621 

.803 

.480 

.686 

.556 

.58 

5 

.803 

.920 

.750 

.896 

.609 

1,34 

6 

.860 

.954 

.818 

.934 

.580 

1.14 

7 

.771 

.916 

.734 

.898 

.599 

1.38 

8 

.863 

,932 

.822 

,922 

.577 

1.22 

9 

.662 

.869 

.600 

.810 

.552 

1.11 

10 

.865 

.958 

.854 

.952 

.571 

1.16 

11 

.771 

.924 

.728 

.880 

.603 

1.32 

12 

.863 

.938 

.836 

.928 

.592 

1.36 

13 

.809 

.924 

.738 

.878 

.514 

3.62 

14 

.868 

.965 

.884 

.946 

.516 

3.59 

15 

.826 

.947 

.754 

.904 

.538 

3.99 

16 

.836 

.946 

.824 

.920 

.506 

3.82 

17 

.788 

.918 

.696 

.866 

.555 

4.70 

18 

.821 

.913 

.760 

.896 

.502 

3.84 

19 

.821 

.945 

.758 

.884 

.506 

3.38 

20 

.890 

.968 

.864 

.954 

.490 

3.53 

21 

.766 

.904 

.710 

.868 

.543 

1.44 

22 

.855 

.932 

.836 

.924 

.548 

1.41 

23 

.682 

.843 

.560 

.734 

.567 

1,57 

24 

.773 

.907 

.706 

.870 

.571 

1.54 

25 

.842 

.956 

.772 

.920 

.471 

4.06 

26 

.892 

.956 

.852 

.940 

.441 

3.14 

27 

.868 

.962 

.846 

.948 

.492 

3.60 

28 

.836 

.940 

.814 

.924 

.585 

1.23 
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the  item  pool,  the  better  a portion  can  stand  for  the  whole.  The  strength 
of  the  effect  of  the  number  of  persons  is  perhaps  surprising,  but  it  may 
be  remembered  that  persons  and  items  are  treated  symmetrically  in  the  pre- 
sent theory.  Another  way  to  look  at  this  is  that  more  persons  enable  the 
program  to  get  clear  information  on  item  difficulties  more  quickly.  There 
is  perhaps  also  a tendency  for  fewer  responses  to  be  required  when  the 
items  are  more  discriminating. 

These  results  are  shown  more  clearly  in  Table  4.  The  upper  part 
shows  the  mean  proportion  of  responses  for  the  2x2x2  subset  of  the  study 
which  was  mentioned  earlier.  Here,  the  strong  effect  of  Items  and  Persons 
is  quite  apparent,  but  the  effect  of  Discrimination  is  weak  or  non-existent. 
For  the  25-item,  40-person  cell,  the  mean  proportion  of  responses  is  .456  . 
The  lower  section  of  the  table  shows  the  results  of  a regr''ssion  analysis 
of  all  28  conditions  (n  = 140)  of  proportion  of  responses  on  all  three 
variables  treated  as  main  effects.  There,  the  F column  shows  the  signifi- 
cance of  the  regression  weight  for  the  individual  variables,  and  Items 
and  Persons  are  clearly  signifcant  but  Discrimination  is  not. 

The  values  used  for  these  variables  are  limited  by  the  practical 
considerations  present  in  this  study,  and  it  would  be  interesting  to  explore 
a wider  range  of  values.  In  particular,  there  must  be  diminishing  returns 
in  the  effect  of  Persons  and  Items,  but  these  data  do  not  permit  the 
assessment  of  the  rate  at  which  that  takes  place.  Also,  there  should  be 
a fairly  strong  effect  of  Discrimination  if  it  reaches  really  low  values 
since  this  would  lead  to  a greater  frequency  of  contradictory  dominance 
relations  with  concomitant  failure  to  pass  the  "significance"  levels. 


k > 


I 

--1 
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Table  4 

Influences  on  Proportion  of  Items 
Proportion  of  Items  Used 
in  2 X 2 X 2 Data  Subset 

10  Persons  25  Persons  40  Persons 


Items 

Items 

Items 

25 

15 

25 

Mean 

15 

25 

Mean 

Discr.  = 

1 

.590 

.609 

.514 

.562 

.543 

.471 

.507 

Discr.  = 

4. 

.577 

.580 

.516 

.548 

.548 

.441 

.494 

Mean 

.584 

.594 

.515 

.555 

.546 

.456 

.501 

n = 2 

n = 20 

n = 6 

t 


Regression  of  Proportion  of  Items  on 
Persons,  Items,  and  Discrimination 

All  Data 


Variable 

b 

beta 

F 

Items 

-.0076 

-.690 

128.05 

.327 

Persons 

-.0036 

-.459 

56.69 

. 524 

Average  Dis- 
crimination 

-.0101 

-.092 

2.38 

.532 
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Validity  of  Tailored  Scores 

The  success  of  a tailored  testing  scheme  is  primarily  measured  by 
the  ability  of  the  tailored  scores  to  substitute  for  scores  on  the  com- 
plete test.  In  a Monte  Carlo  study  like  this  where  a true  score  is  avail- 
able, correlation  with  true  score  would  appear  to  be  the  most  appropriate 
criterion  by  which  to  judge  the  success  of  TAILOR. 

As  noted  earlier,  the  Tailored  validities  are  close  to,  but  lower 
than,  the  complete  data  validities.  The  validities  are  also  apparently, 
fron  Table  3,  quite  variable,  ostensibly  as  a function  of  various  para- 
meters of  the  situation.  A number  of  analyses  were  made  to  attempt  to 
identify  the  characteristics  which  affect  Tailored  validity. 

The  primary  effect  is  from  the  consistency  of  the  actual  sample  of 
data.  This  is  illustrated  graphically  in  Figures  2 and  3 which  plot 
mean  Tailored  correlation  as  a function  of  mean  complete  correlation 
for  each  of  the  28  conditions.  Figure  2,  shows  tau  and  Figure  3,  r. 

Treating  each  of  the  five  replications  under  each  condition  separately, 
the  correlation  of  complete  and  Tailored  validities  are  .86  and  .83  for 
tau  and  r,  respectively.  The  slopes  of  the  regression  lines  in  the  Figures 
are  greater  than  unity,  showing  that  a given  effect  on  the  validity  of 
a complete  test  will  have  an  even  greater  effect  on  the  validity  of  the 
items  in  Tailored  format. 

Influences  of  Independent  VaricJales  on  Validity 

A variety  of  regression  analyses  and  analyses  of  variance  and  co- 
variance  were  performed  to  identify  influences  on  validity.  Tailored 
Tau,  the  rank-order  correlation  of  Tailored  score  with  True  score  was 
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the  dependent  variable  in  most  of  these  analyses.  Some  of  the  analyses 
were  duplicated  with  both  dependent  variables,  and  the  results  were 
never  in  conflict.  We  focus  on  tau  for  two  reasons.  One  is  the  ordinal 
emphasis  of  our  approach,  which  makes  the  rank-order  agreement  more  appro- 
priate. The  other  is  of  a more  empirical  nature.  There  is  more  variabil- 
ity in  tau,  and  it  is  farther  from  the  limit,  whereas  the  range  of  r is  up 
toward  high  values  where  variability  is  limited.  Futhermore,  tau  showed 
slightly  greater  sensitivity  to  the  independent  variables,  viz.  the  .86  vs. 
.83  correlations  with  Complete  Test  validity. 

Regression  analyses  based  on  the  values  of  the  parameters  in  Table  2 
for  all  28  conditions  were  performed,  treating  Tailor  Tau  as  the  dependent 
variable.  Analyses  of  variance  of  svibsets  of  that  data  were  also  performed 
as  were  analyses  of  covariance  treating  Complete  Tau  as  the  covariate. 

These  showed  only  one  significcant  interaction,  so  the  regression  analysis, 
which  includes  the  main  effects  of  all  the  independent  varicibles,  provides 
a valid  summary.  The  analyses  of  variance  and  covariance  will  be  presented 
also,  however. 

The  validity  of  the  Tailored  score  should  clearly  be  dependent  on  the 
validity  of  the  data  on  which  it  is  based.  The  latter  is  represented  here 
most  clearly  as  the  correlation  of  the  Complete  data  score  with  the  True 
score.  Here,  this  correlation  is  a function  of  three  manipulated  variables 
and  a random  effect  which  is  dependent  on  the  actual  sampling  of  items 
and  the  stochastic  nature  of  the  response.  The  three  manipulated  variables 
are  the  mean  discrimination  index  for  the  population  of  items,  the  probab- 


ility of  chance  success  in  the  model,  and  I'he  number  of  items. 


Table  5 shows  the  results  of  stepwise  regression  analyses  of  Tailored 
Tau  on  predictors.  The  analyses  were  performed  using  the  SPSS  package 
(Nie,  Null,  Jenkins,  Steinbrenner , and  Bent,  1975) . The  upper  section  shows 
the  reults  when  Complete  Tau  is  included,  and  the  lower  when  it  is  not. 

The  "Significant  Variables"  are  those  which  significantly  increased  the  mul- 
tiple R as  they  were  included.  After  they  were  included,  none  of  the  re- 
maining ones  were  shown  as  adding  significantly  to  the  multiple  R.  (There 
is  a possibility  with  such  a procedure  that  some  combination  of  the  not- 
included  variables  might  have  added  singificantly , but  this  possibility  is 
remote  and  often  unstable  when  it  occurs.) 

Complete  Tau  was  the  first  variable  entered  in  the  regression  equation, 
accounting  for  74  per  cent  of  the  variance  of  all  140  observations.  How- 
ever, two  of  its  causes.  Mean  Discrimination  and  Chance  Probability,  added 
significantly  to  the  multiple,  contributing  five  and  two  per  cent  of 
variance,  respectively.  All  three  have  weights  of  the  expected  sign: 
higher  Complete  Tau  and  higher  Mean  Discrimination  lead  to  higher  Tailored 
Tau,  and  greater  Chance  Probability  leads  to  lower. 

When  only  the  manipulated  variables  are  included,  the  third  influence 
on  Complete  test  validity  also  becomes  significant.  Now,  Mean  Discrimina- 
tion accounts  for  almost  half  the  variance  (49%) ; Chance  Probability  con- 
tributes another  eight,  and  the  number  of  items  a final  five  percent. 

The  total  percentage  of  variance  accounted  for  is  62  rather  them  82  as  it 
was  when  the  consistency  of  the  actual  data  set  was  included.  The  number 
of  persons,  the  variability  in  discrimination  of  the  items,  and  the 
mean  stcmdard  deviation  in  difficulty  are  not  significant  influences 


on  validity  with  the  levels  of  parameters  used  here.  The  level  of  pre- 
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Variables 

b 

be  ta 

F 

Including  Complete  Tau 

Significant  Variables: 

Complete  Tau 

.8124 

.6400 

183.64 

.743 

Mean  Discrimination 

.0585 

.3270 

49.32 

.796 

Chance  Probability 

-.2366 

-.1525 

16.15 

.818 

Nonsignificant  Variables: 
Items 
Persons 

Standard  Deviation  of 
Discrimination 

Standard  Deviation  of 
Difficulty 

Meeui  Difficulty 


II.  Manipulated  Variables  Only 
Significant  Variables: 
Mean  Discrimination 
ChcUice  Probability 
Items 

Nonsignificant  Variables: 
Persons 

Standard  Deviation  of 
Discrimination 

Standard  Deviation  of 
Difficulty 

Mean  Difficulty 

1 - - 


1215 

.6794 

162.94 

.490 

4910 

-.3166 

35.68 

.572 

0053 

.2362 

19.39 

.625 

cision  here  (witness  the  small  proportion  of  variance  for  Items)  is  such 
that  these  variables  must  only  have  small  effects,  if  any.  The  lack  of 
negative  effects  for  numbers  of  persons  and  items  is  particularly  interest- 
ing in  view  of  the  fact  that  a smaller  proportion  of  responses  are  used 
as  these  increase. 

The  28  conditions  contain  a number  of  subdivisions  which  represent 
orthogonal  designs  representing  two  or  three  of  the  manipulated  variables. 
These  allowed  the  investigation  of  a number  of  first-order  and  a few  second- 
order  interactions  among  the  variables  as  well  as  the  highlighting  of  the 
main  effects.  That  is,  there  are  a number  of  small  factorial  designs  con- 
tained in  the  total  set;  the  other  varicdsles  are  held  constant  at  some 
particular  combination  of  levels. 

One  such  subdesign  is  shown  in  Table  6.  The  analysis  of  varicuice 
(einova)  and  the  analysis  of  covariance  (ancova)  show  that  there  is  an 
effect  for  interaction.  Even  when  corrected  for  the  Complete  Tau  as  a co- 
variate, cheuiging  the  discrimination  index  from  1.0  to  2.0  will  increase 
the  Tailored  Tau  by  about  .06.  Not  considering  the  matrix  consistency, 
i.  e.,  the  actual  effect  of  changed  discrimination  in  practice  would  in- 
crease validity  by  about  . 10 . 

Two  combinations  of  conditions  cross  Mean  Discrimination  with  Chance 
Probability.  One  varied  Chance  Probability  at  the  levels  0,  .1,  .2  and 
Mean  Discrimination  at  1.0  or  2.0  with  25  Items  and  25  Persons.  The  other 
used  only  the  two  extreme  levels  of  Chance  (0  and  .2) , used  the  same 


Mean  Discrimination  = 1 Mean  Discrimination  = 2 


Persons  Persons 


Items 

25 

40 

Items 

25 

40 

15 

.750 

.710 

15 

.818 

.836 

25 

.738 

.772 

25 

.884 

.852 

II 

.09  ; 

Fj  = 3.74  ; 

= 37.87: 

All  interactions 

a ... 

S.  D.  Discrimination  = 0;  S.  D.  Difficulty  = 1.0;  Chance  Probability  = 0; 
Mean  Difficulty  = 0. 


Table  7 

Effect  of  Discrimination  and  Chance 
Probcibility  on  Validity^ 


Persons  = 25, 

I terns  = 

25 

Persons  = 40, 

Items  = 15 

Chcince  Probability 

Chance  Probability 

Discrim.  0.0 

0.1 

0.2 

Discrim. 

0.0 

0.2 

1.0  .738 

.754 

.696 

1.0 

.710 

.560 

2.0  .884 

.824 

.760 

2.0 

.836 

.706 

F^  = 14.19;  F^ 

a 

= 4.01 

II 

Q 

17.69;  F^ 

= 18.75 

S.  D,  Discrimination  = 0 

t S • D* 

Difficulty  = 1.0. 

18.75 
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significant  interaction,  by  analysis  of  variance.  In  the  2x3  case.  Mean 
Discrimination  is  significant  at  ,01  and  Chance  at  .05.  In  the  other,  both 
are  significant  at  .01  . Slight  differences  in  the  relative  sizes  of  the 
effects  in  the  two  cases  lead  to  different  significant  effects  when  the 
covariate  is  used.  In  the  2x3  case.  Mean  Discrimination  is  significant 
at  .01  while  Chance  is  not  significant  (although  F = 1.94) . In  the  2x2 
data,  the  situation  is  reversed.  Chance  Probability  is  significant  at  .05, 
whereas  Mean  Discrimination  is  not  (although  F = 3.59).  It  seeems  lilcely 
that  both  have  a real  effect  (Cf.  the  regression  analysis)  but  that  random 
fluctuations  of  small  magnitude  led  to  the  failure  of  one  to  reach  signifi- 
cnance  in  one  case  while  the  other  failed  in  the  second.  The  use  of  only 
the  extreme  levels  of  Chance  Probability  in  the  second  case  may  have  contri- 
buted. The  alternative  is  that  there  is  a high-level  interaction  with  Items 
or  Persons  or  both.  Unfortunately,  the  choice  levels  confounds  the  persons 
and  items  in  these  data.  Full  explication  would  require  using  a four-variable 
factorial,  which  did  not  seem  justified.  If  these  uncertainties  are  set 
aside,  it  appears  that  items  with  an  inherent  discrimination  parameter  of 

1.0  and  a zero  guessing  probability  will  behave  ctbout  as  well  as  items  with 
a discrimination  parameter  of  2.0  and  a guessing  probability  of  .2,  i.  e., 
five-choice  items. 

The  effect  of  discrimination  indices  when  they  vary  over  a wider  range 
is  shown  in  Table  8,  which  includes  a .5  discrimination  index  as  well  as 

1.0  and  2.0.  Note  that  all  other  variables  are  fixed  at  certain  levels,  of 
which  the  most  important  is  that  Chance  Probability  is  zero.  Also,  the 
lower  levels  of  Persons  eind  Items  are  used.  Lowering  discrimination  from 

1.0  to  .5  has  an  even  stronger  effect  here  than  the  step  from  2.0  to  1.0, 
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Table  8 

Effect  of  Lowest  Mean 
Discrimination  on  Validity^ 

Mean  Discrimination^ 


.5 

1.0 

2.0 

636 

.750 

.818 

II 

Q 

b. 

11.91 

Items  = 

15;  Chance 

Probability  = 0; 

S.  D.  Difficulty  = 1.0;  Mean  Difficulty  = 0; 
S.  C.  Discrimination  = 0. 


Table  9 


Effect  of 

Lowest  Levels  of 

Persons 

on  Validity^ 

Mean  Discrimination 

Number 

of  Persons 

10 

25 

40 

Mean 

1.0 

.724 

.738 

.772 

.745 

2.0 

.846 

.884 

.852 

.861 

Mean 

.785 

.811 

.812 

.803 

= 26.62 

; Pp  = .62 

Items  = 25;  Chauice 

Probability  = 0; 

S.  D.  Difficulty  = 

1.0; 

I Mean  Difficulty  = 0;  S.  D.  Discrimination  = 0 
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apparently,  although  the  latter  difference  happens  to  be  smaller  here  than 

it  is  in  roost  of  the  data.  These  differences  remain,  even  increase  slightly, 

in  the  analysis  of  covariance,  although  there  the  significant  level  just 

2 

misses  the  .01  criterion  (R  = .67)  whereas,  rather  anomolously,  this  is 
reached  in  the  analysis  of  variance. 

As  an  experiment  in  the  robustness  of  the  system,  the  number  of  persons 
was  reduced  all  the  way  to  ten  in  the  next  set  of  data.  As  shown  in  Table  4, 
the  proportion  of  items  is  somewhat  increased,  but  is  still  below  60  per 
cent  with  25  items.  Table  9 shows  that  there  is  essentially  no  effect  on 
validity,  and  the  analyses  of  variance  and  covariance  confirmed  thisj  both 
Persons  and  Persons  by  Discrimination  interaction  were  far  from  significance. 
There  is  presumably  some  effect  on  efficiency,  because  more  items  are  asked, 
but  this  balances  out  to  give  equivalent  validity  for  the  final  result. 
Discrimination  has  its  usual  strong  effect  here.  The  important  thing  is 
that  the  TAILOR  system  will  apparently  operate  adequately  with  as  few  as 
ten  persons,  provided  item  discrimination  reaches  these  acceptable  levels 
and  the  guessing  probability  is  negligible. 

Monte  Carlo  studies  typically  assume  a constant  value  for  the  discrimin- 
ation index,  whereas  in  actuality  items  are  variable  in  their  discriminatory 
power.  The  next  set  of  data  investigated  the  effect  of  variable  discrimina- 
tory power  (S.  D.  Discrimination)  on  the  validity  of  TAILOR  scores.  Stan- 
dard deviations  of  .2  and  .4  were  used,  the  latter  only  with  Mean  Discrimina- 
tion of  2.0.  Thus  the  manipulations  were  relatively  mild  since  all  items  sam- 
pled would  still  have  substantial  positive  discrimination.  With  these  levels. 


there  is  no  effect  whatever,  as  can  be  seen  in  Table  10.  The  analyses  of 
variance  euid  covariance  confirmed  this.  Because  of  the  unbaleuiced  nature 


Table  10 


Effect  of  Variation  in  Discrimination, 

Mean  Discrimination,  and  Number  of 
Items  on  Validity^ 

Mean  Discrimination  = 1.0  Mean  Discrimination  = 2.0 

S.  D.  Discrimination  S.  D.  Discrimination 


Items 

0 

.2 

.4 

Mean 

0 

.2 

.4 

Mean 

15 

.750 

.728 

— 

.739 

.818 

.836 

.814 

.827 

25 

.738 

.758 

— 

.748 

.884 

.864 

.846 

.865 

Mean 

.744 

.743 

.744 

.851 

.850 

.830 

.844 

Persons  = 25,  S.  D.  Difficulty  = 1.0;  Chance  Probability  = 0;  Mean 
Difficulty  = 0 
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of  the  design,  two  analyses  were  done,  one  of  the  2x3  with  Mean  Discrimin- 
ation of  2.0,  on  the  right,  and  one  of  the  2x2x2  which  is  left  when 
S.  D.  Discrimination  of  .4  is  deleted.  All  show  tiny  mean  squares  for 
S.  D.  Discrimination.  The  usual  effect  of  Mean  Discrimination  is  present, 
and  there  is  a .05  level  main  effect  for  items  in  the  anova  which  washe3 
out  in  the  ancova.  No  interactions  approach  significance.  Thus,  the  TAILOR 
procedure  is  not  sensitive  to  mild  variations  in  the  discriminating  power 
of  the  items  in  the  pool  being  used. 

The  final  set  of  analyses  attempted  to  assess  the  effect  of  mismatches 
of  the  item  difficulty  for  the  population  tested.  In  almost  all  the  data 
used  here,  the  item  difficulty  was  normally  distributed  with  a mean  of  0 
and  variance  1.0,  i.e.,  difficulty  has  the  Scime  distribution  as  true  score. 
The  actual  items  simulated  were  sampled  from  these  populations,  so  they 
would  have  characteristics  that  deviated  from  the  population  values  due 
to  sampling.  These  last  sets  of  data  dealt  with  items  which  deliberately 
deviated.  In  particular,  items  from  a population  with  a mean  difficulty 
of  .5  (and  a variance  of  1.0)  were  sampled.  These  data  are  in  the  left 
section  of  Table  11.  Additionally,  a second  set  with  a difficulty  stan- 
dard deviation  of  2.0  were  sampled. 

The  moderate  deviation  in  mean  difficulty  had  no  significant  effect 
in  either  the  anova  or  the  ancova.  On  the  other  hand.  Table  11  shows 
appreciable  differences  for  S.  D.  Difficulty,  primarily  as  an  interaction 
with  Mean  Discrimination.  The  main  effect  is  not  significant  (F  = 2.86) , 
but  the  interaction  is  significant  at  .01,  indicating  that  items  which 
are  highly  variable  in  difficulty  wor)(  better  than  moderately  variable 
ones  when  discrimination  is  high,  but  the  reverse  is  true  when  discrimin- 


P 
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Table  11 

Effects  of  Difficulty  Parameters 

a 

on  Validity 


Mean 

Discrimination 

Mean 

0 

Difficulty 

.5 

Mean 

Discrimination 

S.  D. 

1.0 

Difficulty 

2.0 

1.0 

.750 

.734 

1.0 

.750 

.600 

2.0 

.818 

.822 

2.0 

.818 

.854 

F 

DISC 

= 7.03  ; 

F = .04 

DIFF 

F 

DISC 

22.84 

• F 

' DIFF 

= 2.86 

^I terns  = 

15;  Persons  = 

25;  Chance 

Probability  = 0; 

S.  D. 

Discrimination  = 0. 

Table  12 


Central  Processing  Time  in  Seconds 
by  Items  and  Persons 


Persons 


Items 

25 

40 

15 

31.3 

59.5 

25 

94.7 

143.9 

35 

ation  is  merely  good.  This  effect  disappears  in  the  ancova.  Apparently 
it  is  virtually  entirely  attributable  to  the  basic  validity  of  the  data 
generated  under  the  various  conditions.  In  fact,  in  this  particular  case 
there  is  no  effect  for  Mean  Discrimination  in  the  ancova. 


Computer  Time 

A real-time  system  is  only  useful  if  it  can  operate  in  real  time, 
and  a computerized  testing  system  cannot  be  too  costly  if  it  is  to  be  adopted. 

The  amount  of  central  processing  unit  time  (CPU)  used  in  each  computer  run 
was  recorded  as  part  of  the  operation  of  the  program.  A few  conditions  that 
were  anomalous  for  technical  programming  reasons  were  deleted,  and  the  aver- 
age CPU  for  each  of  the  main  combinations  of  Persons  and  Items  were  computed. 

These  are  given  in  Table  12  where  it  is  apparent  that  both  have  a substan- 
tial effect,  particularly  Items.  Pro-rated  across  persons,  this  indicates  j 

that  about  four  seconds  of  CPU  is  expended  per  subject  with  a pool  of  25 
items.  This  is  admittedly  on  a highly  efficient  370/158  installation,  but  j 

at  charges  which  are  currently  about  five  cents  per  CPU  second,  computing 
costs  do  not  seem  to  be  a major  factor.  It  should  be  noted  as  well  that  a 
good  part  of  the  computer  time  went  for  overhead  routines  which  were  used 
to  monitor  the  process  and  would  not  be  included  in  an  operational  version  | 

of  the  program.  Furthermore,  we  foresee  substantial  increases  in  program  j 

efficiency  in  the  near  future,  and  computing  costs  seem  to  be  continuing 

I 

their  historic  decline,  rather  than  leveling  off.  Therefore,  we  do  not  j 

foresee  computing  considerations  being  a major  factor  with  item  pools  of 

I 

1 

substantially  larger  size. 

I 


s 
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Summary  of  Results  of  the  Monte  Carlo  Study 

It  appears  that  under  a variety  of  circumstances  the  TAILOR  procedure 
works  quite  well.  Without  any  pretesting,  it  arrives  at  a reasonable  approxi- 
mation to  the  total  score  on  a test,  using  about  half  the  items.  Moreover, 
the  percentage  decreases  with  the  number  of  persons  and  items  without  there 
being  a concommitant  loss  in  validity.  The  major  determinant  of  the  validity 
of  the  Tailored  score  is  the  validity  of  the  item  responses  on  which  it  is 
based.  It  is  somewhat  more  sensitive  to  influences  on  consistency  such 
as  Mean  Discrimination  and  Chance  than  the  total  score  is.  It  is  relatively 
robust  with  respect  to  variations  in  a variety  of  parameters,  and  computa- 
tionally efficient  enough  for  practical  use,  provided  the  items  are  of  the 
levels  of  quality  used  here. 

1 


1 


I 
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Data  Bank  Simulation 


Data  Source 

Responses  from  human  subjects  were  also  used  in  simulation  studies 
of  TAIIjOR  in  order  to  see  if  the  results  generated  from  the  Birnbaum  model 
(Birnbaum,  1968)  would  carry  over  to  human  responses.  The  approach  here 
was  to  make  use  of  a data  bank  consisting  of  the  complete  item  response 
matrix  for  a large  sample  of  persons.  Specifically,  the  data  was  a file 
of  responses  of  622  children  to  the  122  items  of  the  Stanford-Binet,  which 
was  made  available  through  the  kind  offices  of  Dr.  Mark  Reckase  of  the 
University  of  Missouri. 

The  children  ranged  in  age  from  24  months  to  178  months  with  a mean 
of  93.7  and  a stcuidard  deviation  of  40.6  months,  with  a more  or  less  uniform 
distribution.  The  mean  IQ  was  117.3  with  a standard  deviation  of  17.6  and 
a range  from  66  to  166.  Thus,  the  sample  was  well  above  average  but  quite 
variable  in  IQ. 

The  total  Scunple  was  first  divided  into  four  age  ranges:  24  to  59, 

60-95,  96-131,  and  132-179.  The  three  younger  grovps  represent  three-year 
spans  and  the  oldest  is  a four-year  one.  This  was  done  in  order  to  reduce 

) 

I 

the  variance  in  ability.  The  96  to  131  age  group  was  felt  to  be  unlikely  j 

to  provide  additional  information  and  was  not  used  in  any  of  the  simulations.  j 

Within  each  age  group,  item  statistics  were  calculated,  i.e.,  proportion 
correct  and  the  discrimination  parameter  (Urry,  1974) . Item  pools  were 
formed  for  each  age  group  on  the  basis  of  the  item  difficulty  by  deleting 
any  subtest  whose  items  were  all  of  1.0  or  0.0  difficulty.  Of  the  122  items 
on  the  total  test,  this  left  54,  72,  and  74  in  the  respective  age  groups. 

i 

I 


These  characteristics  of  the  age  groups  and  item  pools  are  given  in  Table  13. 


Table  13 


Characteristics  of  Age-Group 
Binet  Data 


Age 

Mean  Dis- 

Binet 

Number 

Mean  Dif- 

S. D.  Dif 

(years) 

N 

crimination 

Levels 

of  Items 

ficulty 

ficulty 

2-4 

156 

.70 

2-6 

to  8-0 

54^ 

.095 

1.84 

5-7 

179 

1.71 

4-0 

to  14-0 

72^ 

-.287 

1.84 

11-14 

130 

1.26 

7-0 

to  Ad . 3 

74^ 

.562 

1.59 

3 

64 

2-6 

to  8-0 

54 

6 

62 

4-0 

to  14-0 

72 

13-14 

60 

7-0 

to  Ad . 3 

74 

Includes  2 items  with  zero  variance,  not  used  to  calculate  mean,  S.  D. 


b 


Includes  1 item  with  zero  variance,  not  used  to  calculate  mean,  S.  D. 
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The  "testing"  situation  thus  at  last  approximates  to  some  degree  one 
where  children  of  a broad  but  limited  age  rcuige  are  given  a test  whose 
items  are  of  a broad  difficulty  range  but  are  not  completely  inappropriate. 
The  mean  discrimination  indices  are  of  interest.  For  the  two  older  groups, 
their  values  seem  to  approximate  the  1.0  to  2.0  values  that  were  the  main 
ones  used  in  the  Monte  Carlo  studies,  while  for  the  youngest  group  they 
average  below  1.0.  The  main  difference  from  the  Monte  Carlo  is  perhaps 
that  there  is  more  variance  in  item  difficulty  here. 

In  addition  to  these  three  groiqss,  three  others  that  were  more  homo- 
geneous with  respect  to  age  were  used.  One  of  these  was  all  the  three- 
year  olds,  one  was  the  six-year-olds,  and  one  was  the  thirteen  and  four- 
teen-year-olds. These  too  are  listed  in  Table  13,  but  it  should  be  noted 
that  the  information  on  items  was  not  separately  derived  for  them  (samples 
were  too  small) . Curtailing  the  range  in  age  will  further  reduce  ability 
variance,  so  in  effect  the  discrimination  of  the  items  is  reduced  over 
the  values  in  the  larger,  more  variable  groups. 

The  major  purpose  here  is  to  observe  the  operation  of  TAILOR  with 
data  from  human  subjects.  The  operating  characteristics  such  as  average 
proportion  of  items  used  are  of  interest,  but  it  would  be  surprising  if 
they  were  very  different  from  those  for  con^jarable  numbers  of  subjects 
and  items  from  the  Monte  Carlo  study  described  earlier.  The  primary  in- 
terest is  in  the  effectiveness  of  the  Tailored  score.  This  was  measured 
by  its  validity  against  true  score  in  the  Monte  Carlo,  but  here  there  is 
no  true  score.  In  its  absence,  a parallel  form  reliability  is  the  most 
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logical  index  of  quality,  so  it  will  be  the  variable  of  major  interest. 
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Mettiod 

Each  age  group  was  treated  separately,  so  the  basic  data  source  was 
the  matrix  of  item  responses  appropriate  to  it.  For  example,  in  the  2-4 
group,  this  was  the  responses  of  the  156  members  of  the  group  to  the  54 
items  that  were  in  the  appropriate  subtests  of  the  Stanford-Binet.  Sample 
sizes  of  20  and  40  were  used,  always  with  25  items.  There  were  five  re- 
plications for  each  age-group  and  each  sample  size.  In  each  replication 
a random  sample  of  persons  was  selected  as  were  two  non -over lapping  samples 
of  25  items.  One  sample  of  items  in  a replication  was  used  only  for  the 
calculation  of  a total  score.  The  other  was  used  as  a source  of  item 
responses  for  TAILOR,  which  operated  in  the  same  way  as  in  the  Monte  Carlo 
investigations,  using  these  as  the  response  matrices.  This  procedure  gave 
a Tailored  and  a Complete  score  for  each  person,  each  based  on  non-over- 
lapping random  samples  of  items  from  the  same  pool.  Pearson  and  tau  cor- 
relations were  computed  between  Complete  cind  Tailored  scores  and  the 
correlations  were  averaged  over  the  five  replications.  In  addition,  average 
correlations  between  Complete  test  scores  on  rcindom  subsets  of  25  items  were 
computed  so  that  the  Complete-Tailor  correlations  could  be  compared  to  them. 

i Thus  there  were  two  parallel- form  reliabilities,  a Complete-Tailor  and  a 

Comp le  te-Comp le  te . 

Results 

TAILOR  behaved  very  similarly  to  the  Monte  Carlo,  as  far  as  apparent 

i 

I 

i mode  of  operation  of  the  program  and  number  of  responses  required  were 

j concerned.  Table  14  gives  the  average  proportion  of  responses  required 

I of  each  subject  for  each  of  the  conditions.  These  are  very  similar  to 

I 

those  for  the  most  similar  conditions  in  the  Monte  Carlo,  perhaps  very  slight- 
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Table  14 


Proportion 

of  Items 

Used  by 

TAILOR 

Wide  Age 

Range 

Narrow 

Age  Rcuige 

Age 

- 

Range 

(Years) 

Age 

- 

Range 

Years) 

2-4 

5-7 

11-14 

Mean 

3 

6 

13-14 

Mean 

n = 20 

.540 

.560 

.556 

.552 

n = 20 

.540 

.560 

.564 

.555 

n = 40 

.444 

.452 

.468 

.455 

n = 40 

.464 

.488 

.488 

.480 

Table  15 

Parallel  Form  Correlation  (tau) 

Conplete  and  Tailored  Tests 

I.  Wide  Age  Range 

20  Persons  40  Persons 


Age 

- 

Range 

(Years) 

Age 

- 

Range 

(Years) 

2-4 

5-7 

11-14 

Mean 

2-4 

5-7 

11-14 

Mean 

Con^)  le  te -T  ai  1 or 

.692 

735 

.695 

.707 

C-T  . 726 

.708 

.724 

.719 

Con5)lete-Com- 

plete 

.717 

741 

.739 

.732 

C-C  . 760 

.759 

.765 

.761 

II. 

Narrow 

Age 

Range 

20 

Persons 

40  Persons 

Age 

- 

Range 

(Years) 

Age 

- 

Range 

(Years) 

3 

6 

13-14 

Mean 

3 

6 

13-14 

Mean 

Complete-Tailor 

.565 

616 

.668 

.616 

C-T  .583 

.601 

.667 

.617 

Con?)  lete-Com- 
plete 

.646 

652 

.733 

.677 

C-C  .636 

.664 

.727 

.676 
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ly  higher.  The  indication  is  that  TAILOR  completes  the  score  matrix  on 
the  basis  of  responses  to  about  half  of  the  25  items,  slightly  more  if 
there  are  20  persons,  somewhat  less  if  there  are  40. 

Table  15  gives  the  parallel  form  correlations  for  Con^lete-Tailor  and 
Complete-Complete  scores.  In  the  wide  age  range  groups,  these  are  very 
similar  taus  averaging  .713  vs.  .747  overall.  The  difference  is  some- 
what larger  in  the  narrow  age  ranges,  .616  vs.  .676.  There  is  no  consis- 
tent effect  for  either  Age  Range  or  Persons  in  the  Wide  Age  Range  data, 
but  there  does  seem  to  be  one  in  the  Narrow  Age  Range.  However,  overall 
it  is  clear  that  there  is  a close  correspondence  between  Complete-Tailor 
(C-T)  and  Complete-Complete  (C-C)  reliabilities. 

This  is  true  in  a correlational  sense  as  well  as  in  cin  overall  cor- 
respondence of  averages,  as  is  displayed  graphically  in  Figure  4 where 
C-T  is  plotted  as  a function  of  C-C  for  the  12  means  of  Table  15.  There 
is  clearly  a high  correlation  (.95),  and  again  a slope  greater  than  unity 
(1.17).  As  in  the  Monte  Carlo  data,  the  major  influence  on  Tailored  re- 
liability is  the  consistency  of  the  basic  data,  cind  the  influence  is  dis- 
proportionate; that  is,  a given  change  in  complete  test  reliability  will 
produce  an  even  larger  change  in  Tailored  reliability,  just  as  was  true 
for  validity  in  the  Monte  Carlo. 

In  the  Monte  Carlo  data,  it  was  found  that  the  two  manipulated  factors, 
Mean  Discrimination  cind  Chance  Probad>ility , had  an  effect  over  and  eibove 
the  effect  of  the  validity  of  the  complete  test.  A similar  effect  is  appar- 
ent here  in  that  analysis  of  covariance  shows  significant  intercept  differ- 
ences between  the  Wide  and  Narrow  range  data,  although  the  differences  are 
not  large.  Note,  for  exan^jle,  that  the  two  Narrow  range  points  with  C-C 
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Figure  4 


Complete  test-tailored  test  reliability  imd  Complete 
test-conplete  test  reliability  for  12  Binet  studies 
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reliabilities  above  .7  fall  below  the  cluster  of  points  representiny  the 
Wide  range. 

The  apparent  progression  in  reliability  as  a function  of  age  in  the 
Narrow  group  (Table  16),  may  represent  a similar  phenomenon.  That  is,  in 
the  Narrow  age  range,  it  is  likely  that  the  mean  discrimination  is  lowest 
in  the  youngest  children  and  greatest  in  the  oldest.  The  mean  discrimi.n- 
ation  in  the  total  group  of  two  to  four  year  olds  was  .70  conpared  to  the 
1.71  and  1.26  for  the  older  groups,  so  it  can  be  expected  to  be  lowest 
among  the  narrower  groups,  three  year  olds  in  coitparison  to  six  year  olds 
and  the  thirteen  and  fourteen  year  olds  also.  The  fact  that  the  latter  is  a 
two  year  span  compared  to  the  single  year  for  six-year  olds  may  account  for 
the  fact  that  consistency  was  higher  here  in  spite  of  the  lower  average  dis- 
crimination in  the  older  group.  These  relations  are  peripheral  to  the  main 
interest,  which  is  the  conparison  of  the  reliability  of  the  Tailor  score 
to  that  for  the  Con^lete . 

A statistical  conparison  was  made  with  the  Binet  data  which  was  not 
made  in  the  Monte  Carlo  study.  This  had  to  do  with  the  accuracy  of  TAILOR 
in  reconstructing  the  conplete  score  matrix,  rather  them  con5)arison  with 
a parallel  measure.  At  the  conclusion  of  a TAILOR  session,  it  has  a pre- 
dicted complete  score  matrix  composed  in  part  of  the  person's  actual  res- 
ponses and  in  part  of  the  responses  which  are  deduced  by  TAILOR  from  them. 

The  latter  can  be  con^ared  to  the  actual  responses  to  these  items,  and  this  . ' 

was  done  with  these  data.  | 

The  results  are  summarized  in  Table  16.  An  average  of  96  per  cent  of  j 

the  unused  responses  were  predicted  correctly  by  TAILOR.  There  is  very 
little  variation  in  the  percentages  as  a function  of  age-group  or  san^jle 
size.  Since  there  are  typically  10  to  15  items  which  an  individual  does 
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Table  16 

Proportions  of  Responses 
Correctly  Predicted  by  TAILOR 


Wide  Age 

Range 

Narrow  Age 

Range 

Age 

- Range 

( Years) 

Age  - 

Range 

(Years) 

2-4 

5-7 

11-14  Mean 

3 

6 

13-14 

Mean 

n = 20 

.950 

.977 

.960  .962 

n = 20 

.933 

.977 

.940 

.950 

n = 40 

.957 

.956 

.951  .955 

n = 40 

.951 

.969 

.953 

.958 

■j! 
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not  take,  this  means  that  in  the  majority  of  persons  all  of  the  responses 
are  predicted  correctly  and  in  almost  all  of  the  remainder  one  or  two  are 
missed  by  TAILOR.  These  few  errors  are  apparently  sufficient  to  drop  the 
correlation  with  a parallel  form  by  a few  points.  It  is  clear,  however, 
that  TAIIXJR  is  quite  accurate  at  predicting  the  outcomes  of  giving  items. 

Summary  of  Binet  Simulation 

The  Binet  data  behaved  very  similarly  to  the  Monte  Carlo  simulation. 

The  item  statistics  indicated  that  the  Stanford-Binet  items  were  grossly 
similar  to  the  values  used  in  the  Monte  Carlo,  cuid  the  proportions  of  res- 
ponses used  were  also  similar.  Given  the  differences  that  the  main  depen- 
dent variable  here  was  a parallel  form  correlation  rather  than  a validity, 
the  results  of  comparison  of  Complete  to  Tailored  correlations  were  quite 
compareible  in  the  two  studies.  Tailored  reliabilities  average  within  a 
few  points  of  Complete  ones,  and  show  similar  effects  for  the  consistency 
of  the  basic  data  matrices.  Moreover,  TAILOR  is  highly  accurate  at  repro- 
ducing the  actual  responses  to  items  which  it  does  not  directly  observe. 

Discussion 

Reported  herein  are  the  results  from  two  extensive  studies  using  TAILOR 
with  artificial  data.  In  the  first  Monte  Carlo  investigation  3800  "persons" 
and  2750  "items"  were  generated  with  a mean  ratio  of  validities  against  true 
scores  for  complete  data  and  validities  for  tailored  data  equal  to  .933.  Us- 
ing the  Binet  data  bank,  1800  "persons"  taking  1500  "items"  were  simulated, 
cind  the  ratio  between  alternate  forms  reliability  with  complete  data  (r  = .712) 
and  alternate  forms  with  one  complete  test  and  a second  tailored  test 
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(r  = .665)  was  very  similarly  .934  . This  is  an  interesting  finding,  for 
each  simulated  person  and  item  carried  no  information  about  its  ability 
or  difficulty.  The  usual  requirement  of  pre-testing,  which  for  stable 
estimates  requires  substantial  samples,  has  been  circumvented.  More  im- 
portant, the  accuracy  of  the  test  results  was  high.  The  implications  of 
these  findings  in  a tailored  testing  context  are  discussed  below. 

Efficiency  of  TAILOR 

How  much  does  a tailored  test  save?  One  answer  to  this  can  be  found 
by  comparing  the  reliability  of  a tailored  test  to  one  which  is  simply 
shortened  to  an  equivalent  length.  Alternatively,  given  the  reliability 
of  tailored  and  complete  tests,  one  can  solve  the  Spearman-Brown  formula 
(Lord  cind  Novick,  1968,  p.  112)  for  the  length  factor  and  compare  it  bo 
the  actual  proportion  of  items  asked  in  the  tailored  version. 

The  latter  was  done  for  the  Monte  Carlo  data,  starting  by  squaring 
the  validities  to  get  a reliability  estimate.  The  Pearson  correlations  were 
used  for  this  purpose.  The  most  representative  case  is  the  data  for  25 
items  and  40  persons,  with  discriminations  of  1.0  and  2.0,  conditions  25 
and  26  of  Table  3 where  Tailored  validities  are  .920  and  .940,  respectively, 
while  the  Con^lete  validities  are  .956  and  .965.  The  relation  of  .965  to 
.940  corresponds  to  using  a test  78.4  per  cent  as  long, vhereas  in  actuality, 
only  44.1  per  cent  of  the  responses  were  used  by  TAILOR.  A test  that  re- 
quires only  an  average  of  cibout  11  responses  is  acting  like  a complete  test 
with  nearly  20.  The  corresponding  data  with  discriminations  of  1.0  is  not 
as  favorable,  but  still  is  encouraging.  Here,  the  validities  of  .956  and 
.920  correspond  to  using  a test  72.6  as  long,  while  in  fact  TAILOR  used 


47.1  per  cent  of  the  possible  responses.  Here,  12  responses  are  acting 
like  an  18-item  test.  More  exactly,  the  ratio  of  actual  responses  to  lengths 
estimated  from  reliability  is  1.778  in  the  case  of  the  2.0  discrimination 
item  pool  and  1.541  in  the  case  of  1.0  discrimination. 

The  corresponding  calculations  for  the  smaller  item  pools  are  not  as 
favoraible.  For  15  items,  the  ratios  of  number  of  responses  to  lengths  esti- 
mated from  reliabilities  are  1.612  for  2.0  discrimination  and  1.257  for 
1.0,  primarily  becuase  of  the  larger  proportion  of  items  that  are  used. 

The  relation  between  the  results  for  25  and  15  items  suggests  that  the 
savings  for  item  pools  of  a more  realistic  size  will  be  even  more  substantial. 
That  is,  the  process  becomes  relatively  more  efficient  as  the  number  of  items 
increases  because  the  items  that  are  used  are  on  the  average  closer  to  the 
person's  ability  levels. 

Parallel  calculations  were  done  for  the  Binet  data.  Here,  though,  the 
correlations  are  between  a tailored  test  and  a complete  test,  analogous  to 
correlations  between  a long  form  and  a short  form,  rather  than  between  forms 
of  equal  length.  This  requires  an  adaptation  of  the  Spearman-Brown  formula 
which  becomes; 


{(k  - l)r  + 1)^ 

where  r is  the  correlation  between  two  parallel  forms  of  equal  lengths  and 
r'  si  the  expected  correlation  of  one  of  them  with  a measure  of  the  same 
ability  which  is  of  an  altered  length,  k times  the  length  of  the  first  pair. 
Given  the  correlations  as  here,  the  formula  may  be  solved  for  k 
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[ This  gives  us  a figure  which  can  be  compared  to  the  proportion  of  items  ac- 

tually used,  on  the  average,  by  TAILOR,  as  was  done  more  simply  with  the 
Monte  Carlo  results. 

The  relevant  data  is  given  in  Table  i-A'  the  proportion  of  items  used, 
the  Complete-Tailor  and  Complete-Complete  correlations,  the  k values  from 
the  formula  above,  cuid  the  ratio  of  k to  the  proportion  of  items. 

The  results  show  that  there  is  an  appreciable  gain  in  efficiency  in 
the  Wide  age  range,  but  only  a small  one  in  the  Narrow.  These  results  are 
similar  to,  but  not  as  favorable  as,  those  for  the  Monte  Carlo.  The  effi- 
ciency ratio  of  1.479  for  TAILOR  using  25  items  and  40  persons  is  close 
to  the  1.541  for  25  items  of  1.0  discrimination  derived  from  the  Monte  Carlo. 
There  may  be  further  increases  in  efficiency  with  a larger  item  pool,  but  on 

' the  other  hand  the  Binet  is  an  extremely  discriminating  test. 

' 

Assessment  Procedures 

As  is  true  of  all  areas  of  research,  the  assessment  of  tailored  test- 
ing procedures  is  subject  to  bias  and  confounding.  The  attempt  has  been 
made  here  to  make  a realistic  assessment  of  TAILOR.  The  Monte  Carlo  cor- 
relations with  true  score,  based  on  results  from  a sample  of  items  from  a 
specified  population  parallel  the  use  of  TAILOR  on  a sample  of  persons 
with  items  which  should  be  of  appropriate  average  difficulty  but  may  not 
be  exactly  so.  The  items'  individual  characteristics  are  unknown.  TAILOR 
was  con^ared  to  just  using  a random  fraction  of  the  items  for  all  persons. 


Efficiency  Calculations  for 
Binet  Data 


20  Persons 
Wide 

40  Persons 
Wide 

20  Persons 

Narrow 

40  Persons 

Narrow 

% Responses 

.552 

.455 

.555 

.480 

C-T  r 

.829 

.866 

.765 

.750 

C-C  r 

.855 

.889 

.811 

.805 

k 

.695 

.673 

.604 

.562 

ratio  k/% 

1.259 

1.479 

1.088 

1.171 

and  found  to  be  appreciably  more  efficient,  particularly  with  highly 
discriminating  items. 

Note  that  questions  that  can  be  raised  about  tailoring  procedures  based 
on  pretest  item  statistics,  questions  concerning  the  sampling  errors  in 
the  statistics  and  possilbe  population  differences  between  pretesting  and 
tailored  populations,  are  not  relevant  here.  Also,  there  is  no  hidden 
cost  of  pretesting.  After  all,  if  a test  must  be  standardized  on  1,000  sub- 
jects so  that  a second  thousand  need  only  take  half  the  items,  the  average 
person  has  actually  taken  three- fourths  of  the  items,  not  half. 

In  the  Binet  data,  where  there  is  no  true  score,  the  correlational 
design  furnished  information  on  the  efficiency  of  TAILOR  using  parallel 
measures.  The  comparison  to  the  con^^lete  test  data  is  confounded  with 
differences  in  the  item  pools  for  tailored  auid  conventional  tests.  The 
estimates  of  the  accuracy  with  which  the  partial  response  patterns  on 
TAILOR  can  reproduce  the  complete  score  matrix  are  not  biased  by  the  inclu- 
sion of  the  responses  themselves. 

Theoretical  Considerations 

The  ordinal  model  that  is  the  basis  for  TAILOR  is  a rather  different 
appraoch  to  test  theory.  One  of  its  advemtages  is  that  it  treats  items 
and  persons  symmetrically,  making  clear  that  many  of  the  things  that  charac- 
terize items  also  characterize  persons,  and  vice  versa.  This  is  particularly 
relevant  to  tailored  testing. 

Conventional  methods  of  tailored  testing  necessarily  rely  on  accurate 
estimates  of  item  difficulty  and  sicrimination.  Giving  items  to  large  pre- 
testing samples  meeuis  that  very  often  an  item  must  be  given  to  a person  for 
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whom  it  is  inappropriate.  That  is,  there  is  an  information  function  for 
items  as  well  as  persons,  and  when  a high  difficulty  item  is  given  to  a 
low  eibility  person,  or  vice  versa,  one  learns  little  about  either  the  item 
or  the  person. 

In  effect,  the  implied  order  system  which  is  the  basis  for  TAILOR  tries 
to  get  around  this  by  matching  person  to  item  as  well  as  item  to  person. 
Consequently,  the  amount  that  it  learns  per  response  is  likely  to  be  greater 
than  for  other  systems,  provided  the  pretesting  phase  of  the  latter  is  in- 
cluded in  the  evaluation.  Rather  amazingly,  it  seems  to  be  possible  to 
make  an  appreciable  savings  on  the  basis  of  a rather  small  amount  of  data. 

Applications 

The  effectiveness  of  TAILOR  depends  on  having  items  which  are  of  high 
discrimination.  This  tends  to  be  true  of  all  tailored  testing  schemes.  If 
discriminations  are  only  moderate,  there  is  not  a great  deal  more  informa- 
tion provided  by  items  near  the  person's  ability  level  and  those  far  from 
it.  Moreover,  it  takes  more  items  to  focus  accurately  on  the  ability  level. 

Item  parameters  are  always  expressed  relative  to  the  standard  deviation 
of  ability.  Tims,  an  item  does  not  have  an  intrinsic  discrimination;  it 
depends  on  the  population.  Thus  tailored  testing  is  likely  to  be  most 
applicable  to  situations  where  variability  is  greatest.  This  seems  likely 
to  be  in  placement  situations  and  those  in  which  training  is  being  assessed. 
It  is  here  that  one  is  likely  to  find  high  varieuice,  primarily  as  a function 
of  whether  training  has  been  given  or  not.  That  is,  there  may  well  be  a 
fairly  definite  joint  order  of  persons  and  items. 

TAILOR,  or  some  descendcmt  of  it,  may  well  be  useful  in  such  context. 
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particularly  where  large-scale  testing  is  infeasible.  This  may  well  be 
true  more  generally  if  the  strengths  and  weaknesses  of  it  and  other  sys- 
tems are  realistically  assessed.  This  will  be  particularly  true  if  it  is 
found,  as  seems  likely,  that  persons  behave  systematically  differently  in 
conputerized-tailored  and  conventional  testing  situations. 
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Reference  Note 


Cudeck,  R. , Cliff,  N.,  Reynolds,  T.  cind  McCormick,  D.  Monte  Carlo 
results  from  a con5>uter  program  for  tailored  testing.  Techni- 
cal Report  No.  2,  Department  of  Psychology,  University  of  Sou- 
thern California,  1976. 
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Appendix 


Monte  Carlo  Data  Generation 


The  data  used  during  the  Monte  Carlo  studies  was  produced  at  the  be- 
ginning of  each  run  in  the  following  way.  For  simplicity  only  a general 
description  is  provided.  Assume  k persons  and  n items. 


Step  1:  Three  vectors  of  length  n and  one  of  length  k were 
produced  which  contained  true  scores  for  item  dis- 
crimination, difficulty  and  guessing  and  the  per- 
son ability  score.  These  were  normally  distributed 
with  a prescribed  mean  and  standard  deviation,  given 


12 

:.  = SD  I UNI.  - 6.0 

" = 1 ^ * 


+ X 


whe  re 

c.  = the  normally  distributed  item  or  person  character- 
istic 

SD  = the  standard  deviation 

UNI  = a vector  of  uniform  random  numbers  in  the  range 
of  0,1  based  on  a function  available  at  USC's 
computer  center 

X = the  desired  mean 


Step  2;  A matrix  P of  rank  k by  n was  computed,  and  contained  the 
probabilities  of  a person  answering  an  item  correctly.  As 
usual 

a.  = item  discrimination 
3 

b^  = item  difficulty 
Cj  = probcibility  of  chance  success 
= person  ability 
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Appendix  (Continued) 


and 


1 

P =c.  + (1-c.)  

^ ^ ^ 1 + expf-l.Ta.Ce,  - b.)! 

t ] k 3 J 

Step  3:  A matrix  £ with  rank  k by  n was  generated^ which  is 
the  score  matrix  based  on  eind  a uniform  random 
number  where 

S^.  = 1 if  WJI<  P,  . 
ki  ki 

S,  . = 0 otherwise 
kr 

and  UNI  is  as  given  above,  1 and  0 indicate  correct 
and  incorrect  responses. 
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